Power Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power Programmierung / Power-Programmierung CD 2 (Tewi)(1994).iso / doc / mir / 15pattrn < prev next >

Wrap

Text File | 1992-06-29 | 33KB | 824 lines

═══════════════════════════════════════ 5. PATTERNS IN BYTE SEQUENCES ═══════════════════════════════════════ Topic 4 showed how byte distributions help us to analyze the content of a file. One simple fact may be obscured by the length of Topic 4... that a byte survey and analysis take very little time. The survey itself might require one to four seconds. Reviewing it might involve another thirty seconds. In Topic 5 we consider sequences of bytes. We want to identify patterns related to our objectives of: » extracting searchable content; » recognizing record separations; » recognizing field separations; and » recognizing formatting aids. ═══════════════════════════════════════ 5.1 Heads and tails... first impressions of a file ═══════════════════════════════════════ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage head file_name [ line_count ] [/a][/t] > text Displays in printable format the first line_count lines within a file; the default is 10 lines. This clone of the Unix HEAD and TAIL utilities provides a quick check on the likely contents of a file. If the "/a" option is used, accented characters are treated as printable text. If "/t" is specified, the display is of the TAIL of the file, the LAST line_count lines. input: Normally an ASCII text file. output: The specified number of lines is either displayed on the screen or sent to a file. Each non-printable character is replaced by an ^ symbol. If any line length exceeds 120 characters, a warning is issued. If any line length exceeds 1024 or the file includes null bytes, the program advises that the target file is not ASCII text. writeup: MIR TUTORIAL ONE, topic 5 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ The HEAD program can be used to get a first impression of the beginning and of the end of most files, although it is best for ASCII text. Try for example: HEAD SVP_TXT The result shows directly on the screen. Alternately it may be redirected into a new file. With the command "HEAD SVP_TXT", ten lines are shown. Next try HEAD SVP_TXT 4 and then HEAD SVP_TXT 20 You are shown 4 lines and the next time 20 lines of text. For line counts greater than 23, be ready to use CTL-S to stop and restart movement across the screen. Adding the argument "/t" switches heads to tails. HEAD SVP_TXT /T causes the last 10 lines of SVP_TXT to be shown on the screen. HEAD SVP_TXT 25 /T > TEMP or HEAD SVP_TXT /T 25 > TEMP places the last 25 lines in a file called TEMP. Note the file name must come first; the order of arguments after that does not matter. Incidentally, there is no restriction on the number of lines. I tried HEAD SVP_TXT /T 4000 and found it worked! ≡≡≡≡->> QUESTION: Input the DOS command "COPY HEAD.C HEAD2.C". Then revise HEAD2.C so that no file is named, and standard input is the source of data. Compile the result and experiment with it. The arguments are simpler, and their order doesn't matter. What are the dangers of using HEAD2.C in a DOS environment? <<-≡≡≡≡ ≡≡≡≡->> QUESTION: Make another copy of HEAD.C and call it TAIL.C. Edit it so that the resulting program needs no "/t" argument and always shows the end of a file. Experiment. <<-≡≡≡≡ Occasionally you might have text containing legitimate accented characters. To demonstrate the "/a" (accents) argument: HEAD SVP_TXT 150 /A > TEMP then HEAD TEMP /T then HEAD TEMP /T /A What's really happening here? You are taking the top 150 lines of a file, storing it in a separate temporary file, then displaying the last 10 lines of the temporary file (that is, lines 141 to 150 of the original file) on the screen. This is a way of showing an intermediate part of a file... not as fast as CPB (copy bytes), but convenient. When you try the last two commands above, do you notice the difference between the two displays? HEAD TEMP /T includes a word that looks like H^tel; when accents are requested in HEAD TEMP /T /A, the same word comes out as Hôtel. ≡≡≡≡->> QUESTION: The experiment fails if you build the temporary file without the /A argument (HEAD SVP_TXT 150 > TEMP). Why does it fail? <<-≡≡≡≡ On a non-text file, HEAD may either show a lot of caret ("^") characters, or conclude that a HEAD display is meaningless. That information is worth the few seconds used to input the command and see the result. ═════════════════════════ 5.2 Non-DOS files ═════════════════════════ Suppose you display the head of a file and find it looks like this: Fourscore and seven years ago, our forefathers brought forth upon this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. This sample is not 80 characters wide, but you get the idea. Each new line starts where the last one left off, and lines wrap around onto the next line when the right margin is reached. This effect is common when UNIX files are brought into a DOS environment. DOS needs a carriage return to match each linefeed chararacter. Here's a simple solution: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ usage: dosify file_name[s] Replaces a UNIX-style file with a copy in which each line feed is preceded by one carriage return, and the file ends with one CTL-Z byte. Use this program on a file in which the MORE command produces a skewed listing that fails to go back to the left margin for new lines. input: Any printable ASCII file[s]. output: The same file, with the same name, with DOS conventional line ends and end of file. writeup: MIR TUTORIAL ONE, topic 5 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ You can dosify a clutch of files in one command: DOSIFY GETTYSBU.RG LINCOLN DOUGLAS WHATEVER and the display begins to make sense: Fourscore and seven years ago, our forefathers brought forth upon this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. DOSIFY appears to change files in place. In reality it makes a copy, and if successful, destroys the original and changes the name of the copy to match the original. ≡≡≡≡->> QUESTION: Using A_BYTES on a non-DOS file, how would you calculate in advance the number of bytes that it will contain after it is dosified? <<-≡≡≡≡ ═════════════════════════════════════ 5.3 Displaying printable data ═════════════════════════════════════ Our immediate objective is to get first impressions of the content of a file. F_PRINT is a filter to show only printable characters within a file. Unlike HEAD, it can start instantly at any point within a file. An accent argument extends the range to include accented (high-bit-set) characters, but not graphics. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ usage f_print file_name [/a][/w] [ from_byte to_byte ] > subset Reduces a file to printable characters only. If the /w option is specified, strings of printable characters that are unlikely to be words are filtered out as well, and each new burst of accepted text is placed on a new line. /a causes accented characters to be accepted as printable. input: Any file whatsoever, or any part of a file. output: Printable subset. writeup: MIR TUTORIAL ONE, topic 5 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ The command F_PRINT SVP_TXT /A 121500 121700 causes the following display. Note the accented byte in the two repetitions of Luçon. CENT <P10>D<P8>EPAUL<R>i.s.C.M. @TEXT60 = <MI>Addressed:<D> Monsieur Le Soudier, Priest of the Mission of Luçon, in Luçon @HEAD4 = 464. - TO N. @TEXT4 = Saint-Lazare, Sunday, July 29, 1640 @TEXT7 For some fun, try F_PRINT on an executable EXE file, first without the /W argument, then with /W. For example, F_PRINT F_PRINT.EXE and F_PRINT /W F_PRINT.EXE The second listing is much shorter and far more intelligible. ≡≡≡≡->> QUESTION: In what ways might you amend source code in F_PRINT.C to get other useful effects? Hint: Try variations in the function check_store. <<-≡≡≡≡ ═══════════════════════════════ 5.4 Detailed data dumps ═══════════════════════════════ Let's move beyond first impressions to methods of displaying exactly what byte sequences occur in a file or part of a file. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage dump file_name [/a] [ from_byte [ to_byte ] ] > report Lists the contents of a specified portion of any file, reporting 16 bytes per line. "/a" causes accented high bit characters to be printed. input: Any file whatsoever. output: Printable ASCII report, listing offset, then 16 bytes in hexadecimal format, with printable ASCII on the right; periods substitute for non-printable bytes. writeup: MIR TUTORIAL ONE, topic 5 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ DUMP permits commands such as: DUMP SVP_TXT 0 800 > TOP_END This preliminary test of the file can be tried on any portion of the file. Moreover, if your target has 367 or fewer bytes, you can send the output directly to the screen without worrying about CTL-S stop and go control, as in: DUMP SVP_TXT 100000 100366 DUMP restricts the printable character set to the 95 byte patterns ranging from hex 20 (space) up to hex 7E. This restriction makes it much easier to recognize ordinary text; it is not surrounded by a jumble of happy faces and graphic characters. (Try the DOS 5.0 MORE command on any EXE executable file and see what you get!) Other characters are in the strict sense printable... carriage returns, line feeds and tabs. For accented characters using PC compatible extended ASCII, add the accents argument "/a": DUMP SVP_TXT /A 11100 11200 Note the accented French word "écus" in the result. ═══════════════════════════════════════════ 5.5 Convenient display of fragments ═══════════════════════════════════════════ Suppose we want to check out other high-bit-set bytes found in the file SVP_TXT. Here is the list created by A_BYTES: é [82] 43 0.0% 11144 12314 13915 14831 18658 23503 23800 26370 â [83] 1 0.0% 207322 à [85] 1 0.0% 116508 ç [87] 7 0.0% 95180 109218 121610 121620 129909 175862 181966 è [8A] 4 0.0% 130386 130571 161305 232659 î [8C] 4 0.0% 65079 93876 95582 138200 ô [93] 10 0.0% 8834 16736 28121 28656 97731 134953 163316 170678 One way to display a byte at a known location with its context is to issue a DUMP command that straddles its location. For example, to view the ç with cedilla at offset 95180: DUMP SVP_TXT /A 95100 95300 would do the job. But DUMP gives too much detail for this purpose. The key lines in the screen display are: 95164: 79 65 3c 5e 3e 39 3c 44 3e 20 69 6e 0d 0a 4c 75 ye<^>9<D> in Lu 95180: 87 6f 6e 3f 3c 5e 3e 31 30 3c 44 3e 20 49 20 66 çon?<^>10<D> I f 95196: 69 6e 64 20 69 74 20 64 69 66 66 69 63 75 6c 74 ind it difficult A more convenient program is FRAGMENT: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage fragment input_file offset > stdout Display a five line fragment of a file in printable form with two lines of context on either side of the selected offset. Useful to get a quick view of contents at a selected location in a file. Use CPB and/or DUMP for an alternate method, less convenient, but with more detail. input: Most useful for printable ASCII files. output: Five double spaced lines in which non-printing characters are shown as blank with a ^ in the blank line below. The character at the exact offset is marked by a | in the blank line below. writeup: MIR TUTORIAL ONE, topic 5 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ To display context of the first ç at offset 95180, simply input: FRAGMENT SVP_TXT 95180 Five lines are displayed; the third line starts this way: Luçon?<^>10<D> I find it difficult to | Notice how the ç is highlighted by the vertical bar | immediately underneath. Try showing the î at byte 138200 with this command: FRAGMENT SVP_TXT 138200 You are shown the context around: M. Benoît<^>2<D> does not return | and again a vertical bar underneath draws attention to the î. FRAGMENT works for any ASCII file, particularly where line lengths are under 80 characters. (Unix users: Many terminals are unpredictable when they attempt to display bytes with the high bit set. The source code contains notes indicating where to make necessary changes.) In the SVP_TXT example, the locations shown above all check out as valid accented characters within French names. Further along, we will find that by using the program A_PATTRN we can verify that all bytes with high bit set in our sample SVP_TXT are valid. ══════════════════════════════════════════════ 5.6 Viewing patterns throughout a file ══════════════════════════════════════════════ The techniques thus far display context at specific points within a file... the beginning, the end, or near certain offsets. More is needed. We want to be able to: » ensure that patterns are consistent across all the data; » identify every set of codes and signals that may help us toward our objectives of interpreting record and field separators, searchable content, etc. At the end of the preceding topic, we concluded that our sample file, SVP_TXT, is extended ASCII text (normal text plus accented letters), and that the usage of certain characters needs to be checked out: @ = ^ | < >. The program A_PATTRN can be used to isolate every occurrence within a file of a single character or of a string of up to 16 characters. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ usage: a_pattrn file_name key [ /x ] [ bytes_before ] > report "/x" = include hex, show only 16 bytes instead of 40 List every occurrence of a key character or string in a file. Show 3 (or "bytes_before", range 0 to 15) bytes prior to the key each time. Normally show a total of 40 bytes each time the key is found; if the "/x" argument is set, show only 16 bytes, but in hex and ASCII both. The key may be from 1 to 16 characters. Within the key, any non-printing characters, characters which may confuse DOS (> or < or |), linefeeds, blanks, backslash, etc. must be shown in hex form... a backslash and 2 hex digits. Examples: a_pattrn herfile \8E > herfile.8e a_pattrn yourfile * 7 > yourfile.ast a_pattrn myfile Mother a_pattrn hisfile \94\05ke\ff 0 > 5char.pat input: Any file whatsoever. output: One line for each occurrence of the target byte(s) in the file. Sort the result to make patterns show up more clearly. writeup: MIR TUTORIAL ONE, topic 5 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ DOS assigns meaning to certain characters (such as space, | < >, etc.), so if you have any problem using the A_PATTRN command, switch to the hex format for the search key (the letter or sequence of letters on which you wish to search). For example, A_PATTRN SVP_TXT > > SVP3E fails, but A_PATTRN SVP_TXT \3E > SVP.3E works. The key benefit of A_PATTRN is that it selects the same byte or string and places it in the same position on each line. Patterns begin to emerge at once. Here are the first few lines produced by: A_PATTRN SVP_TXT \3C\5E > RESULT 00000641: men<^>2<D> want to communicate..in writi 00001138: ot]<^>3<D> concern the hospital, you..ca 00001467: uf,<^>4<D> on Madame..Goussault's<^>5<D> 00001497: t's<^>5<D> estates, I believe, although 00001678: eu,<^>6<D> and to do so as soon as possi 00002197: ne,<^>2<D> which I see from your..letter 00003152: is;<^>3<D> who put together his council 00003412: us?<^>4<D> Indeed, there is no good or.. 00003747: ey,<^>5<D> who on another..occasion spok 00005086: eva<^>6<D> and,..following his example, \3C\5E appears as <^ starting at byte 17 in each line. It is a simple matter to perform a sort: SORT /+17 < RESULT > RESULT.SRT The hexadecimal output from A_PATTRN (when argument /x is used) looks just like the output of DUMP. Here we have shortened the lines a bit. 00000641: 6d 65 6e 3c 5e 3e 32 3c 44 3e 20... men<^>2<D> want 00001138: 6f 74 5d 3c 5e 3e 33 3c 44 3e 20... ot]<^>3<D> conce 00001467: 75 66 2c 3c 5e 3e 34 3c 44 3e 20... uf,<^>4<D> on Ma 00001497: 74 27 73 3c 5e 3e 35 3c 44 3e 20... t's<^>5<D> estat 00001678: 65 75 2c 3c 5e 3e 36 3c 44 3e 20... eu,<^>6<D> and t 00002197: 6e 65 2c 3c 5e 3e 32 3c 44 3e 20... ne,<^>2<D> which 00003152: 69 73 3b 3c 5e 3e 33 3c 44 3e 20... is;<^>3<D> who p 00003412: 75 73 3f 3c 5e 3e 34 3c 44 3e 20... us?<^>4<D> Indee 00003747: 65 79 2c 3c 5e 3e 35 3c 44 3e 20... ey,<^>5<D> who o 00005086: 65 76 61 3c 5e 3e 36 3c 44 3e 20... eva<^>6<D> and,. The hex result can also be sorted (SORT /+20). When dealing with fully printable files, the hex rendition of each byte is not particularly useful. The one piece of information the hex version provides is that the ".." pattern within the ASCII is usually a line feed - carriage return combination. Whichever output is selected, we discover that the two bytes "<^" in the file SVP_TXT are in every case followed by ">", a one or two digit number, then "<D>". The lowest numbers, 1 and 2, are most frequent. The frequency falls off steadily so that the highest, "<^>27<D>" occurs only once. Looking at the patterns around the single character hex 3C ("^") alone reveals two other combinations: "<B^>#<D>" and "<I^>#<D>". The three basic patterns <^>, <B^> and <I^> account for all occurrences of the caret character "^". ≡≡≡≡->> QUESTION: DOS 5.0 has a "FIND" command which can also be used to list every line in which a character sequence appears. Compare the respective advantages of FIND and A_PATTRN. <<-≡≡≡≡ ═════════════════════════════════════════ 5.7 The power of sorting patterns ═════════════════════════════════════════ Suppose we look for patterns around the single "at sign" (@) character: A_PATTRN SVP_TXT @ > AT_SIGN.SVP The result contains 935 lines which start out as follows: 00000000: ...@HEAD1 = SAINT VINCENT DE PAUL..@HEAD 00000032: L..@HEAD2 = CORRESPONDENCE..@HEAD4 = 417 00000057: E..@HEAD4 = 417. - TO SAINT LOUISE DE MA 00000121: S..@TEXT4 = Paris, January 11, 1640..@TE 00000155: 0..@TEXT7 = Mademoiselle,..@TEXT6 = I re 00000179: ,..@TEXT6 = I received three letters fro 00000605: ...@TEXT6 = Seeing that those Gentlemen< 00001613: ...@TEXT6 = You would do well to send fo 00001799: ...@TEXT6 = People are praying to God fo 00001941: ...@HEAD4 = 418. - TO LOUIS ABELLY,<B^>1 We may have tripped upon the record separators and field separators that we are looking for. Notice particularly the pattern @HEAD4 = which is followed by a number. We have several options at this point. One is to lengthen the key and re-run the pattern analysis: A_PATTRN SVP_TXT @HEAD > AT_HEAD.SVP and A_PATTRN SVP_TXT @TEXT > AT_TEXT.SVP Alternately, since the earlier listing AT_SIGN.SVP has only 51,425 bytes, we can sort it beginning at column 17: SORT /+17 < AT_SIGN.SVP > AT_SIGN.SRT As we view the sorted result, patterns become very clear. Here is part of the analysis that I reported after a few more tests with the A_PATTRN program. At this point, analysis was still tentative, but it provided a good basis for discussion with the database provider. Analysis of SVP_TXT ASCII text with Printer's Codes (Headers) February 12, 1992 The following are tentative interpretations to the Printer's Codes embedded in SVP_TXT. Corrections to errors would be welcome. @HEAD1 database heading, 1 occurrence at beginning @HEAD2 database subheading, 1 occurrence " @HEAD4 sequence number, 1 per letter @TEXT31 dateline @TEXT4 dateline, letter from s.v.p. @TEXT41 dateline, letter to s.v.p. from other person @TEXT5 signature line, s.v.p. @TEXT51 signature line, other person @TEXT6 paragraph start, letter from s.v.p. @TEXT60 @TEXT600 @TEXT61 paragraph start, letter from other person @TEXT611 address line to s.v.p. @TEXT7 salutation from s.v.p. @TEXT71 salutation to s.v.p. <169> beginning quote (7X) <170> end quote (7X) <197> dash (23X) <B^>1<D> superscript footnote ref in heading (11X) <D> terminator for other < > symbols (594X) <I^>#<D> superscript footnote ref in heading (59X) <M> emphasis -- bold, highlight, italics? (6X) <MI> emphasis -- bold, etc.? (134X) <P10>, <P7>, <P7M>, <P8>, <P8MI>, <P9> pica measures for font size (212X) <R> ?? 18 of 21X <R>i.s.C.M. in signature <^>#<D> superscript footnote ref in text (409X) <|> blank position holder, not to end line (28X) ═══════════════════════════════ 5.8 Sorting large files ═══════════════════════════════ As files get larger, the DOS SORT slows down. Sorting the 935 lines (51,425 bytes) in AT_SIGN.SVP took 30 seconds on a 12 megahertz AT clone. As the SORT 64k byte limit is approached, things fall apart. A description follows for SORT2, a device to get around the 64k limit. It's not elegant, but it works! ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage - sort2 [/r] [/+n] from_file to_file key[s] Sorts large ASCII text files using the memory-bound DOS SORT routine in multiple passes. /r signifies reverse order. /+n specifies a starting column, 1-999. A key is 1 to 3 characters, used as a dividing point. The program separates the input file into a series of temporary files, depending on the byte(s) at the starting column. For n dividing points, the program makes n+1 temporary files, and reports the size of each. If all are under 60k characters, they are sorted and placed together in the output file. If a run fails, add another dividing point mid-way in the range that fails (that is, the file that is too big), and try again. NOTE: The DOS SORT starts column count at 1, converts all lower to upper case! input: Line oriented printable ASCII. output: Same file, sorted. writeup: MIR TUTORIAL ONE, topic 5 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ≡≡≡≡->> QUESTION: Try out SORT2, get a feel for how it works. Can you come up with ways to make it easier to use or more powerful? Or do you have your own super sort that you are willing to publish under copyleft rules? <<-≡≡≡≡ Another way to speed up sorts is to throw away portions of the target file that are not essential for the purpose you have in mind when sorting. For example, the program A_PATTRN produces 8 byte offsets followed by a colon and white space, and up to 40 bytes of information. Are they all necessary? The program COLRM removes the same columns from every line of ASCII text in a file: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage - colrm from_col to_col < printable_ascii > revised_ascii Removes the specified range of columns from each line of an ASCII file. This is a clone of the Unix "colrm" utility. input: A printable ASCII file with less than 512 characters per line. Columns number from 1 upward. output: The same number of lines, but with one segment of columns removed from each line. writeup: MIR TUTORIAL ONE, topic 5 ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Since ASCII text is the only accepted input, we are safe in a DOS environment to use standard input and output. There is no confusion over line feeds and CTL-Z characters. An added benefit is that we can pipe the output of successive runs of COLRM. Recall the earlier example A_PATTRN SVP_TXT \3C\5E > RESULT which produced output that started off like this: 00000641: men<^>2<D> want to communicate..in writi 00001138: ot]<^>3<D> concern the hospital, you..ca 00001467: uf,<^>4<D> on Madame..Goussault's<^>5<D> 00001497: t's<^>5<D> estates, I believe, although 00001678: eu,<^>6<D> and to do so as soon as possi 00002197: ne,<^>2<D> which I see from your..letter 00003152: is;<^>3<D> who put together his council 00003412: us?<^>4<D> Indeed, there is no good or.. 00003747: ey,<^>5<D> who on another..occasion spok 00005086: eva<^>6<D> and,..following his example, Our primary interest is in the patterns of the form <^>#<D> and <^>##<D>. We could remove the first 16 columns by: COLRM 1 16 < RESULT > TEMP and those that follow the 8 characters of interest by COLRM 9 99 < TEMP > RESULT2 Notice that you can use any large number that will reach to the end of all lines. Alternately, you can do the two steps in one: COLRM 1 16 < RESULT | COLRM 9 99 > RESULT2 RESULT2 has only 4,090 bytes, in contrast to the 22,086 in RESULT. The new file starts off like this: <^>2<D> <^>3<D> <^>4<D> <^>5<D> <^>6<D> <^>2<D> <^>3<D> <^>4<D> <^>5<D> <^>6<D> Eighty per cent reduction in a file size pays off when sorting. A_OCCUR is useful in analyzing sorted files that contain many repetitions. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage - a_occur [ min_freq ] [ /n ] < ascii_text > report /n = non-sequenced data is okay Count the frequency of occurrence of identical lines If a minimum frequency is specified, lines occurring fewer times are dropped entirely from the result. Input: ASCII text, which must be in sorted order UNLESS the flag "/n" is included. Output: A reduced copy of the file with each line shown only once. Each line begins with a frequency count, padded out to six characters with blanks. Writeup: MIR TUTORIAL ONE, topic five. See also the related programs A_OCCUR2 and A_OCCUR3. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Here is the top end of the output when we input the command A_OCCUR < RESULT2 > FREQ 10 <^>10<D> 8 <^>11<D> 6 <^>12<D> 5 <^>13<D> 5 <^>14<D> 3 <^>15<D> 3 <^>16<D> 3 <^>17<D> 2 <^>18<D> 2 <^>19<D> Frequency is the first element. For example, the pattern <^>11<D> occurs 8 times, <^>12<D> occurs 6 times. It was the regularly declining frequency of the numbers that first suggested to me that these tags indicate footnote numbers within the test file SVP_TXT. To finish this topic, we mention two simple utility programs that are related to A_OCCUR. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ usage - a_occur2 [ min_frequency [ filename_under_min ] ] < merged a_occur files > combined A utility to calculate cumulative frequency of merged A_OCCUR outputs. If a minimum frequency is specified, then all lower frequency items are either suppressed or sent to a file named in the next argument. Input: ASCII text, in which each line starts with a number (a frequency count) followed by blanks, then sorted text starting in the seventh column. Output: A copy of the same file in which multiple identical lines are shown only once, preceded by the combined frequency count. Writeup: MIR TUTORIAL ONE, topic five. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ Usage a_occur3 < occur_file > expanded_file Reverse an A_OCCUR file by removing the initial count, then outputting each line the number of times indicated by the count. Useful if editing an A_OCCUR file, then reconstituting it. input: ASCII file with each line containing a count, blank padded to the sixth character, then the line content. output: Same content, but with leading six characters removed and content repeated for "count" lines. Writeup: MIR TUTORIAL ONE, topic five. ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ * * * * * We have identified several methods of viewing portions of a computer file. Each is an aid in analyzing file content. The most powerful aid is the A_PATTRN program. Its output may be sorted so that the context of any character or sequence of up to 16 characters may be examined. Interpreting the results becomes easier as you acquire experience with various kinds of data. The next few topics offer additional tools and pointers for analysis.